Project Title: Predictors of Major League Baseball Player Injury in 2017
Author Name: Lacey Gleason
Email Address: lpgleas@emory.edu
library(readxl)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
library(tm)
## Loading required package: NLP
library(NLP)
# Import transaction data from MLB.com
# Downloaded as Excel file - includes all transactions for all months in 2017
transactions <- read_excel("Transactions.xlsx",
sheet = "Sheet1")
head(transactions)
## # A tibble: 6 x 2
## Date Transaction
## <dttm> <chr>
## 1 2017-01-02 00:00:00 Miami Marlins signed free agent OF Mark Traylor to …
## 2 2017-01-02 00:00:00 Cincinnati Reds signed free agent RHP Geoff Broussa…
## 3 2017-01-02 00:00:00 Cincinnati Reds signed free agent RHP Deunte Heath …
## 4 2017-01-02 00:00:00 Cincinnati Reds signed free agent C Adrian Nieto to…
## 5 2017-01-02 00:00:00 Washington Nationals signed free agent LHP Stone Sp…
## 6 2017-01-03 00:00:00 Miami Marlins signed free agent 1B Tyler Moore to a…
There were 11,727 MLB transactions during the 12 months of 2017. The imported dataset contains a column of transaction dates and a column of text that describes the transactions. Within this text is the information about team, player name, player position, type of transaction, and type of injury (where applicable for disabled list transactions) that we want to pull out.
First, we need to filter just for disabled list transactions.
# Filter just for those transactions that contain the term 'disabled list' - call dataframe new
new <- dplyr::filter(transactions, grepl('disabled list', Transaction))
head(new)
## # A tibble: 6 x 2
## Date Transaction
## <dttm> <chr>
## 1 2017-02-14 00:00:00 Cincinnati Reds placed RHP Homer Bailey on the 60-d…
## 2 2017-02-14 00:00:00 Texas Rangers placed DH Prince Fielder on the 60-da…
## 3 2017-02-14 00:00:00 Texas Rangers placed LHP Jake Diekman on the 60-day…
## 4 2017-02-14 00:00:00 Atlanta Braves placed LHP Jacob Lindgren on the 60-…
## 5 2017-02-15 00:00:00 Kansas City Royals placed LHP Brian Flynn on the 60…
## 6 2017-02-15 00:00:00 Los Angeles Dodgers placed RHP Yimi Garcia on the 6…
There were 1,386 MLB transactions in 2017 that involved the disabled list. We would like to know what types of transactions those were.
# create a new variable for action to reflect what kind of DL action it is
## Placed means first instance of player being assigned to DL
## Transferred means player was moved from shorter DL list to longer DL list
## Activated means player was put back on active roster from DL
new$action <- ifelse(grepl("transfer", new$Transaction, ignore.case = T), "Transferred",
ifelse(grepl("place", new$Transaction, ignore.case = T), "Placed", ifelse(grepl("activate", new$Transaction, ignore.case =T), "Activated", "Other")))
#make a frequency table of action
act_count <- plyr::count(new, 'action')
act_count
## action freq
## 1 Activated 543
## 2 Placed 710
## 3 Transferred 133
The proportion of disalbed list actions that were activations was 39.8%. The proportion of disabled list actions that were placements was 52.0%. The proportion of disabled list actions that were transfers was 9.7%. The percentages do not add up to 100% due to rounding. Since there were no “Other” actions in frequency table, we know that we’re not missing a DL action other than those listed.
Next, we’d like to pull out player characteristics from the text.
# Create variable for position of player
new$position <- ifelse(grepl(" C ", new$Transaction, ignore.case = F), "Catcher",
ifelse(grepl("1B", new$Transaction, ignore.case = F), "First Baseman", ifelse(grepl("2B", new$Transaction, ignore.case =F), "Second Baseman", ifelse(grepl("3B", new$Transaction, ignore.case =F), "Third Baseman", ifelse(grepl("SS", new$Transaction, ignore.case =F), "Shortstop", ifelse(grepl("RF", new$Transaction, ignore.case =F), "Right Fielder", ifelse(grepl("CF", new$Transaction, ignore.case =F), "Center Fielder", ifelse(grepl("LF", new$Transaction, ignore.case =F), "Left Fielder", ifelse(grepl("RHP", new$Transaction, ignore.case =F), "Pitcher", ifelse(grepl("LHP", new$Transaction, ignore.case =F), "Pitcher", ifelse(grepl(" P ", new$Transaction, ignore.case =F), "Pitcher", ifelse(grepl("DH", new$Transaction, ignore.case =F), "Designated Hitter", ifelse(grepl("OF", new$Transaction, ignore.case =F), "Outfielder", "Other")))))))))))))
# Create frequency table of position
pos_count <- plyr::count(new, 'position')
pos_count
## position freq
## 1 Center Fielder 91
## 2 Designated Hitter 15
## 3 First Baseman 50
## 4 Left Fielder 105
## 5 Other 95
## 6 Outfielder 6
## 7 Pitcher 763
## 8 Right Fielder 51
## 9 Second Baseman 66
## 10 Shortstop 66
## 11 Third Baseman 78
We see that 55% of the players involved in DL transactions were pitchers. Later, after we de-duplicate the placements, transfers, and activations, we can compare the player type distribution to that of the active players in the league overall in 2017.
Next, we will look at which type of disabled list is utilized in most disabled list transactions. There is a 10-day disabled list, a 60-day disabled list, and a special 7-day disabled list for concussions. There have been some interesting changes to the length of the disabled list over the years (e.g., elimination of 15-day disabled list and re-introduction of 10-day disabled list) to try to nudge teams and players to make the best choices in weighing a player’s health and a team’s need for the player’s on-field contributions. Timing of these policy changes will be kept in mind during this analysis.
# Create a new variable called list that refelcts type of DL referenced in transaction
new$list <- ifelse(grepl("10-day", new$Transaction, ignore.case = T), "10-day",
ifelse(grepl("60-day", new$Transaction, ignore.case = T), "60-day", ifelse(grepl("7-day", new$Transaction, ignore.case = T), "7-day", "Other")))
#make a frequency table of type of disabled list
list_count <- plyr::count(new, 'list')
list_count
## list freq
## 1 10-day 1181
## 2 60-day 168
## 3 7-day 37
attach(new)
list_table <- table(list,action) # A will be rows, B will be columns
list_table # print table
## action
## list Activated Placed Transferred
## 10-day 408 640 133
## 60-day 124 44 0
## 7-day 11 26 0
prop.table(list_table, 2) # column percentages
## action
## list Activated Placed Transferred
## 10-day 0.75138122 0.90140845 1.00000000
## 60-day 0.22836096 0.06197183 0.00000000
## 7-day 0.02025783 0.03661972 0.00000000
We see that 85% of the disabled list transactions (90% of placements) involved actions related to the 10-day disabled list. Twelve percent of the DL transactions (6% of placements) involved the 60-day disabled list and 3% of the DL transactions (4% of placements) involved the 7-day disabled list. All of the transfers involved transfers from the 10-day disabled list.
Another important characteristic of DL placements is whether they are retroactive to an earlier date. If a player has not played in up to 5 days, the team can then retroactively place the player on the disabled list and have those days count towards meeting the length requirement of the disabled list. Since teams are protected by being able to retroactively place players on the DL, they tend to list players as day-to-day in case they get better within five days and can avoid having to serve a full ten days on the DL.
# Create new variable to indicate if this DL transaction is a retroactive move?
new$retro <- ifelse(grepl("retroactive", new$Transaction, ignore.case = T), "Yes", "No")
#make a frequency table of action
retro_count <- plyr::count(new, 'retro')
retro_count
## retro freq
## 1 No 1078
## 2 Yes 308
attach(new)
## The following objects are masked from new (pos = 3):
##
## action, Date, list, position, Transaction
retro_table <- table(retro,action) # A will be rows, B will be columns
retro_table # print table
## action
## retro Activated Placed Transferred
## No 543 404 131
## Yes 0 306 2
prop.table(retro_table, 2) # column percentages
## action
## retro Activated Placed Transferred
## No 1.00000000 0.56901408 0.98496241
## Yes 0.00000000 0.43098592 0.01503759
Examining the table above, we see that 22% of the DL transactions are retroactive. This represents 43% of DL placements in 2017.
Another interesting component that we want to pull out of this transaction text is information about the injuries precipitating DL transactions. Below, we will look at if surgeries are involved.
# Create a variable to indicate if surgery is indicated in DL transaction description
new$surg <- ifelse(grepl("surgery", new$Transaction, ignore.case = T), "Yes", "No")
#make a frequency table of surgery
surg_count <- plyr::count(new, 'surg')
surg_count
## surg freq
## 1 No 1357
## 2 Yes 29
attach(new)
## The following objects are masked from new (pos = 3):
##
## action, Date, list, position, retro, Transaction
## The following objects are masked from new (pos = 4):
##
## action, Date, list, position, Transaction
surg_table <- table(surg,action) # A will be rows, B will be columns
surg_table # print table
## action
## surg Activated Placed Transferred
## No 543 687 127
## Yes 0 23 6
prop.table(surg_table, 2) # column percentages
## action
## surg Activated Placed Transferred
## No 1.00000000 0.96760563 0.95488722
## Yes 0.00000000 0.03239437 0.04511278
There were 29 disabled list transactions in 2017 that mentioned surgery. Surgeries were involved in 23 (3%) of DL placements.
Let’s look at concussions.
# Create variable for concussion involvement
new$concuss <- ifelse(grepl("concussion", new$Transaction, ignore.case = T), "1", "0")
#make a frequency table of action
concuss_count <- plyr::count(new, 'concuss')
concuss_count
## concuss freq
## 1 0 1342
## 2 1 44
attach(new)
## The following objects are masked from new (pos = 3):
##
## action, Date, list, position, retro, surg, Transaction
## The following objects are masked from new (pos = 4):
##
## action, Date, list, position, retro, Transaction
## The following objects are masked from new (pos = 5):
##
## action, Date, list, position, Transaction
con_table <- table(concuss,action) # A will be rows, B will be columns
con_table # print table
## action
## concuss Activated Placed Transferred
## 0 543 670 129
## 1 0 40 4
prop.table(con_table, 2) # column percentages
## action
## concuss Activated Placed Transferred
## 0 1.00000000 0.94366197 0.96992481
## 1 0.00000000 0.05633803 0.03007519
There were 44 disabled list transactions in 2017 that mentioned concussions. Forty (6%) of DL placements involved concussions. This is around the number we would expect since we saw there were 37 instances where the 7-day disabled list was used, which is specfically for concussions.
Next, we will look at which side of the body is involved in the injury.
# Create new variable to indicate side of body of injury
new$side <- ifelse(grepl("right", new$Transaction, ignore.case = T), "Right", ifelse(grepl("left", new$Transaction, ignore.case = T), "Left", "Unknown"))
#make a frequency table of action
side_count <- plyr::count(new, 'side')
side_count
## side freq
## 1 Left 240
## 2 Right 385
## 3 Unknown 761
attach(new)
## The following objects are masked from new (pos = 3):
##
## action, concuss, Date, list, position, retro, surg,
## Transaction
## The following objects are masked from new (pos = 4):
##
## action, Date, list, position, retro, surg, Transaction
## The following objects are masked from new (pos = 5):
##
## action, Date, list, position, retro, Transaction
## The following objects are masked from new (pos = 6):
##
## action, Date, list, position, Transaction
side_table <- table(side,action) # A will be rows, B will be columns
side_table # print table
## action
## side Activated Placed Transferred
## Left 0 210 30
## Right 4 311 70
## Unknown 539 189 33
prop.table(side_table, 2) # column percentages
## action
## side Activated Placed Transferred
## Left 0.000000000 0.295774648 0.225563910
## Right 0.007366483 0.438028169 0.526315789
## Unknown 0.992633517 0.266197183 0.248120301
For those DL placements that included the side of the body involved in the injury, 60% involved the right side of the body. However, 27% of all DL placements did not list side of the body.
Next, let’s look at DL transactions that mentioned Tommy John surgery, a common procedure for pitchers.
# Tommy John Surgery indicator
new$tom <- ifelse(grepl("Tommy John", new$Transaction, ignore.case = T), "1", ifelse(grepl(" UCL ", new$Transaction, ignore.case = T), "1", ifelse(grepl("ulnar collateral", new$Transaction, ignore.case = T), "1", "0")))
#make a frequency table of action
tom_count <- plyr::count(new, 'tom')
tom_count
## tom freq
## 1 0 1368
## 2 1 18
attach(new)
## The following objects are masked from new (pos = 3):
##
## action, concuss, Date, list, position, retro, side, surg,
## Transaction
## The following objects are masked from new (pos = 4):
##
## action, concuss, Date, list, position, retro, surg,
## Transaction
## The following objects are masked from new (pos = 5):
##
## action, Date, list, position, retro, surg, Transaction
## The following objects are masked from new (pos = 6):
##
## action, Date, list, position, retro, Transaction
## The following objects are masked from new (pos = 7):
##
## action, Date, list, position, Transaction
tom_table <- table(tom,action) # A will be rows, B will be columns
tom_table # print table
## action
## tom Activated Placed Transferred
## 0 543 698 127
## 1 0 12 6
prop.table(tom_table, 2) # column percentages
## action
## tom Activated Placed Transferred
## 0 1.00000000 0.98309859 0.95488722
## 1 0.00000000 0.01690141 0.04511278
Overall, 1.7% of DL placements and 4.5% of DL transfers involve Tommy John surgery.
# create variable for month of transaction - later it will be useful to have this
new$month <- format(new$Date,"%B")
Next, let’s see if we can make a wordcloud of the disabled list descriptions.
DL_lines <- grep("disabled list", transactions$Transaction, value = TRUE)
length(DL_lines)
## [1] 1386
head(DL_lines)
## [1] "Cincinnati Reds placed RHP Homer Bailey on the 60-day disabled list. Right elbow surgery."
## [2] "Texas Rangers placed DH Prince Fielder on the 60-day disabled list. Disc herniation in neck."
## [3] "Texas Rangers placed LHP Jake Diekman on the 60-day disabled list. Colon surgery."
## [4] "Atlanta Braves placed LHP Jacob Lindgren on the 60-day disabled list. Recovery from Tommy John surgery."
## [5] "Kansas City Royals placed LHP Brian Flynn on the 60-day disabled list. Stable lumbar vertebral fracture."
## [6] "Los Angeles Dodgers placed RHP Yimi Garcia on the 60-day disabled list. Recovering from Tommy John Surgery"
# Make wordcloud of words used in disabled list transactions
library(wordcloud)
library(RColorBrewer)
library(tm)
library(NLP)
wordcloud(VCorpus(VectorSource(DL_lines)), max.words = 15, scale =c(5.5,.4), colors = topo.colors(n=30), random.color = TRUE)
#Get rid of common words so that word cloud is more interesting
# took out common words like "the" and also word that are part of multiple teams' names
#(e.g., Los, York, Chicago)
pattern <- "disabled"
DL_lines2 <- sub(pattern, "", DL_lines)
tail(DL_lines2)
## [1] "Los Angeles Dodgers activated LF Andrew Toles from the 60-day list."
## [2] "Philadelphia Phillies activated RHP Zach Eflin from the 60-day list."
## [3] "Philadelphia Phillies activated RHP Jerad Eickhoff from the 60-day list."
## [4] "Philadelphia Phillies activated RHP Vince Velasquez from the 60-day list."
## [5] "Pittsburgh Pirates activated 2B Josh Harrison from the 60-day list."
## [6] "St. Louis Cardinals activated RHP Alex Reyes from the 60-day list."
DL_lines2[200:220]
## [1] "San Francisco Giants placed LHP Madison Bumgarner on the 10-day list. Bruised ribs and sprained left shoulder."
## [2] "Cleveland Indians activated 2B Jason Kipnis from the 10-day list."
## [3] "Detroit Tigers placed SS Jose Iglesias on the 7-day list retroactive to April 20, 2017. Concussion."
## [4] "Los Angeles Angels placed RHP Mike Morin on the 10-day list retroactive to April 20, 2017. Neck tightness."
## [5] "Chicago White Sox placed RHP James Shields on the 10-day list retroactive to April 18, 2017. Strained right lat."
## [6] "Tampa Bay Rays placed LHP Xavier Cedeno on the 10-day list retroactive to April 18, 2017. Left forearm tightness."
## [7] "Texas Rangers placed RHP A.J. Griffin on the 10-day list retroactive to April 18, 2017. Gout in left ankle."
## [8] "Detroit Tigers placed 1B Miguel Cabrera on the 10-day list. Right groin strain."
## [9] "Chicago White Sox activated C Geovany Soto from the 10-day list."
## [10] "Chicago White Sox transferred CF Charlie Tilson from the 10-day list to the 60-day disabled list. Stress reaction in right foot."
## [11] "Philadelphia Phillies transferred RHP Clay Buchholz from the 10-day list to the 60-day disabled list. Torn flexor tendon in right elbow."
## [12] "Los Angeles Angels transferred RHP Garrett Richards from the 10-day list to the 60-day disabled list. Right biceps strain."
## [13] "Los Angeles Angels placed RHP Cam Bedrosian on the 10-day list. Right groin strain."
## [14] "Minnesota Twins placed RHP Justin Haley on the 10-day list. Right bicep tendinitis."
## [15] "Tampa Bay Rays placed RHP Tommy Hunter on the 10-day list. Right calf strain."
## [16] "San Diego Padres activated RHP Luis Perdomo from the 10-day list."
## [17] "Oakland Athletics transferred RHP Chris Bassitt from the 10-day list to the 60-day disabled list. Recovering from Tommy John surgery."
## [18] "Toronto Blue Jays transferred CF Dalton Pompey from the 10-day list to the 60-day disabled list. Concussion."
## [19] "Toronto Blue Jays placed SS Troy Tulowitzki on the 10-day list. Strained right hamstring."
## [20] "San Francisco Giants placed CF Denard Span on the 10-day list."
## [21] "Philadelphia Phillies placed RHP Aaron Nola on the 10-day list retroactive to April 21, 2017. Lower back strain."
pattern2 <- "list"
DL_lines2 <- sub(pattern2, "", DL_lines2)
pattern3 <- "Chicago"
DL_lines2 <- sub(pattern3, "", DL_lines2)
pattern4 <- "San"
DL_lines2 <- sub(pattern4, "", DL_lines2)
pattern5 <- "Los"
DL_lines2 <- sub(pattern5, "", DL_lines2)
pattern6 <- "New"
DL_lines2 <- sub(pattern6, "", DL_lines2)
pattern7 <- " on"
DL_lines2 <- sub(pattern7, "", DL_lines2)
pattern8 <- " the"
DL_lines2 <- sub(pattern8, "", DL_lines2)
pattern9 <- "from"
DL_lines2 <- sub(pattern9, "", DL_lines2)
pattern10 <- " The "
DL_lines2 <- sub(pattern10, "", DL_lines2)
pattern11 <- "Angeles"
DL_lines2 <- sub(pattern11, "", DL_lines2)
pattern12 <- "list."
DL_lines2 <- sub(pattern12, "", DL_lines2)
pattern13 <- "2017"
DL_lines2 <- sub(pattern13, "", DL_lines2)
pattern14 <- "York"
DL_lines2 <- sub(pattern14, "", DL_lines2)
pattern15 <- "Red"
DL_lines2 <- sub(pattern15, "", DL_lines2)
pattern16 <- "Blue"
DL_lines2 <- sub(pattern16, "", DL_lines2)
pattern17 <- "Bay"
DL_lines2 <- sub(pattern17, "", DL_lines2)
pattern18 <- "to the"
DL_lines2 <- sub(pattern18, "", DL_lines2)
pattern19 <- " and"
DL_lines2 <- sub(pattern19, "", DL_lines2)
pattern20 <- "Disabled"
DL_lines2 <- sub(pattern20, "", DL_lines2)
pattern21 <- "Sox"
DL_lines2 <- sub(pattern21, "", DL_lines2)
wordcloud(VCorpus(VectorSource(DL_lines2)), max.words = 15, scale =c(5.5,.4), colors = topo.colors(n=30), random.color = TRUE)
Run logistic regression on 20% sample of data to train
Use model to predict outcome for the remaining 80% of data
| Week | Task |
|---|---|
| March 26 | Finish text analysis of transactio data and merge Baseball Reference data on player characteristics of interest and add to disabled list data matching by player name, position, and team |
| April 2 | Create Table 1 for descriptive statistics, create tables/graphs for model results |
| April 2 | Write first draft of introduction and results sections of manuscript |
| April 9 | Write discussion/conclusions section of manuscript |
| April 16 | Submit manuscript |
| April 23 | Submit presentation |
The link for this file is located at https://github.com/lpgleason/Project.git.